Prosper Loan Exploration by Uday Goel

This report explores a dataset containing loan attributes and respective borrower details for approximately 114,000 loans

Univariate Plots Section

Below are the dimensions and the variables for the data set

## [1] 113937     81
##  [1] "ListingKey"                         
##  [2] "ListingNumber"                      
##  [3] "ListingCreationDate"                
##  [4] "CreditGrade"                        
##  [5] "Term"                               
##  [6] "LoanStatus"                         
##  [7] "ClosedDate"                         
##  [8] "BorrowerAPR"                        
##  [9] "BorrowerRate"                       
## [10] "LenderYield"                        
## [11] "EstimatedEffectiveYield"            
## [12] "EstimatedLoss"                      
## [13] "EstimatedReturn"                    
## [14] "ProsperRating..numeric."            
## [15] "ProsperRating..Alpha."              
## [16] "ProsperScore"                       
## [17] "ListingCategory..numeric."          
## [18] "BorrowerState"                      
## [19] "Occupation"                         
## [20] "EmploymentStatus"                   
## [21] "EmploymentStatusDuration"           
## [22] "IsBorrowerHomeowner"                
## [23] "CurrentlyInGroup"                   
## [24] "GroupKey"                           
## [25] "DateCreditPulled"                   
## [26] "CreditScoreRangeLower"              
## [27] "CreditScoreRangeUpper"              
## [28] "FirstRecordedCreditLine"            
## [29] "CurrentCreditLines"                 
## [30] "OpenCreditLines"                    
## [31] "TotalCreditLinespast7years"         
## [32] "OpenRevolvingAccounts"              
## [33] "OpenRevolvingMonthlyPayment"        
## [34] "InquiriesLast6Months"               
## [35] "TotalInquiries"                     
## [36] "CurrentDelinquencies"               
## [37] "AmountDelinquent"                   
## [38] "DelinquenciesLast7Years"            
## [39] "PublicRecordsLast10Years"           
## [40] "PublicRecordsLast12Months"          
## [41] "RevolvingCreditBalance"             
## [42] "BankcardUtilization"                
## [43] "AvailableBankcardCredit"            
## [44] "TotalTrades"                        
## [45] "TradesNeverDelinquent..percentage." 
## [46] "TradesOpenedLast6Months"            
## [47] "DebtToIncomeRatio"                  
## [48] "IncomeRange"                        
## [49] "IncomeVerifiable"                   
## [50] "StatedMonthlyIncome"                
## [51] "LoanKey"                            
## [52] "TotalProsperLoans"                  
## [53] "TotalProsperPaymentsBilled"         
## [54] "OnTimeProsperPayments"              
## [55] "ProsperPaymentsLessThanOneMonthLate"
## [56] "ProsperPaymentsOneMonthPlusLate"    
## [57] "ProsperPrincipalBorrowed"           
## [58] "ProsperPrincipalOutstanding"        
## [59] "ScorexChangeAtTimeOfListing"        
## [60] "LoanCurrentDaysDelinquent"          
## [61] "LoanFirstDefaultedCycleNumber"      
## [62] "LoanMonthsSinceOrigination"         
## [63] "LoanNumber"                         
## [64] "LoanOriginalAmount"                 
## [65] "LoanOriginationDate"                
## [66] "LoanOriginationQuarter"             
## [67] "MemberKey"                          
## [68] "MonthlyLoanPayment"                 
## [69] "LP_CustomerPayments"                
## [70] "LP_CustomerPrincipalPayments"       
## [71] "LP_InterestandFees"                 
## [72] "LP_ServiceFees"                     
## [73] "LP_CollectionFees"                  
## [74] "LP_GrossPrincipalLoss"              
## [75] "LP_NetPrincipalLoss"                
## [76] "LP_NonPrincipalRecoverypayments"    
## [77] "PercentFunded"                      
## [78] "Recommendations"                    
## [79] "InvestmentFromFriendsCount"         
## [80] "InvestmentFromFriendsAmount"        
## [81] "Investors"

The dataset has 81 variables. I have selected 15 variables that I believe are unrelated and will provide distinct variables for analysis. After a quick glance at the dataset, I realise that the columns “CreditGrade” and “ProsperRating (Alpha)” both provide same information, with the only difference that “CreditGrade” is filled for loans listed pre-2009 and the other for post - 2009. I have checked that there are no loans with both fields filled but there are loans with no value in both fields.

## [1] "CreditGrade levels:"
## [1] ""   "A"  "AA" "B"  "C"  "D"  "E"  "HR" "NC"
## [1] "ProsperRating (Alpha) levels: "
## [1] ""   "A"  "AA" "B"  "C"  "D"  "E"  "HR"
## [1] "CreditGrade variable information:"
## 
##              Cancelled             Chargedoff              Completed 
##                      5                   6650                  18288 
##                Current              Defaulted FinalPaymentInProgress 
##                      0                   4010                      0 
##   Past Due (>120 days)   Past Due (1-15 days)  Past Due (16-30 days) 
##                      0                      0                      0 
##  Past Due (31-60 days)  Past Due (61-90 days) Past Due (91-120 days) 
##                      0                      0                      0
## [1] 28953    81
## [1] "ProsperRating(Alpha) variable information:"
## 
##              Cancelled             Chargedoff              Completed 
##                      0                   5336                  19664 
##                Current              Defaulted FinalPaymentInProgress 
##                  56576                   1005                    205 
##   Past Due (>120 days)   Past Due (1-15 days)  Past Due (16-30 days) 
##                     16                    806                    265 
##  Past Due (31-60 days)  Past Due (61-90 days) Past Due (91-120 days) 
##                    363                    313                    304
## [1] 84853    81
## [1] "Dimensions of data set with data in neither variable:"
## [1] 131  81
## [1] "Dimensions of data set with data in both variables:"
## [1]  0 81

I wish to include the loan credit rating in the analysis and so I have created a new column called “LoanRating” which is a combination of the values from CreditGrade and ProsperRating..Alpha. The 131 blank values are converted to NA and the variable is converted to a factor variable. This new column is then included in my list of shortlisted columns for further analysis. I also notice that there are columns for the date and quarter of loan origination. I will create a new column for the year of loan origination as that will be more useful to see if the relationships have changed over time.

## [1] "Table for new variable Year:"
## 
##  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014 
##    22  5906 11460 11552  2047  5652 11228 19553 34345 12172

I have the necessary variables now to create a subset of the data set.

## [1] "Dimensions of the new data set: "
## [1] 113937     17
## [1] "Structure of the new data set: "
## 'data.frame':    113937 obs. of  17 variables:
##  $ LoanRating            : Factor w/ 8 levels "AA","A","B","C",..: 4 2 7 2 5 3 6 4 1 1 ...
##  $ Term                  : int  36 36 36 36 36 60 36 36 36 36 ...
##  $ BorrowerAPR           : num  0.165 0.12 0.283 0.125 0.246 ...
##  $ EstimatedLoss         : num  NA 0.0249 NA 0.0249 0.0925 ...
##  $ ProsperScore          : num  NA 7 NA 9 4 10 2 4 9 11 ...
##  $ BorrowerState         : Factor w/ 52 levels "","AK","AL","AR",..: 7 7 12 12 25 34 18 6 16 16 ...
##  $ Occupation            : Factor w/ 68 levels "","Accountant/CPA",..: 37 43 37 52 21 43 50 29 24 24 ...
##  $ CreditScoreRangeLower : int  640 680 480 800 680 740 680 700 820 820 ...
##  $ CurrentCreditLines    : int  5 14 NA 5 19 21 10 6 17 17 ...
##  $ TotalInquiries        : num  3 5 1 1 9 2 0 16 6 6 ...
##  $ RevolvingCreditBalance: num  0 3989 NA 1444 6193 ...
##  $ DebtToIncomeRatio     : num  0.17 0.18 0.06 0.15 0.26 0.36 0.27 0.24 0.25 0.25 ...
##  $ IncomeRange           : Factor w/ 8 levels "$0","$1-24,999",..: 4 5 7 4 3 3 4 4 4 4 ...
##  $ StatedMonthlyIncome   : num  3083 6125 2083 2875 9583 ...
##  $ LoanOriginalAmount    : int  9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
##  $ Year                  : chr  "2007" "2014" "2007" "2012" ...
##  $ Quarter               : chr  "Q3" "Q1" "Q1" "Q4" ...
## [1] "Summary of the new data set: "
##    LoanRating         Term        BorrowerAPR      EstimatedLoss  
##  C      :23994   Min.   :12.00   Min.   :0.00653   Min.   :0.005  
##  B      :19970   1st Qu.:36.00   1st Qu.:0.15629   1st Qu.:0.042  
##  D      :19427   Median :36.00   Median :0.20976   Median :0.072  
##  A      :17866   Mean   :40.83   Mean   :0.21883   Mean   :0.080  
##  E      :13084   3rd Qu.:36.00   3rd Qu.:0.28381   3rd Qu.:0.112  
##  (Other):19465   Max.   :60.00   Max.   :0.51229   Max.   :0.366  
##  NA's   :  131                   NA's   :25        NA's   :29084  
##   ProsperScore   BorrowerState                      Occupation   
##  Min.   : 1.00   CA     :14717   Other                   :28617  
##  1st Qu.: 4.00   TX     : 6842   Professional            :13628  
##  Median : 6.00   NY     : 6729   Computer Programmer     : 4478  
##  Mean   : 5.95   FL     : 6720   Executive               : 4311  
##  3rd Qu.: 8.00   IL     : 5921   Teacher                 : 3759  
##  Max.   :11.00          : 5515   Administrative Assistant: 3688  
##  NA's   :29084   (Other):67493   (Other)                 :55456  
##  CreditScoreRangeLower CurrentCreditLines TotalInquiries   
##  Min.   :  0.0         Min.   : 0.00      Min.   :  0.000  
##  1st Qu.:660.0         1st Qu.: 7.00      1st Qu.:  2.000  
##  Median :680.0         Median :10.00      Median :  4.000  
##  Mean   :685.6         Mean   :10.32      Mean   :  5.584  
##  3rd Qu.:720.0         3rd Qu.:13.00      3rd Qu.:  7.000  
##  Max.   :880.0         Max.   :59.00      Max.   :379.000  
##  NA's   :591           NA's   :7604       NA's   :1159     
##  RevolvingCreditBalance DebtToIncomeRatio         IncomeRange   
##  Min.   :      0        Min.   : 0.000    $25,000-49,999:32192  
##  1st Qu.:   3121        1st Qu.: 0.140    $50,000-74,999:31050  
##  Median :   8549        Median : 0.220    $100,000+     :17337  
##  Mean   :  17599        Mean   : 0.276    $75,000-99,999:16916  
##  3rd Qu.:  19521        3rd Qu.: 0.320    Not displayed : 7741  
##  Max.   :1435667        Max.   :10.010    $1-24,999     : 7274  
##  NA's   :7604           NA's   :8554      (Other)       : 1427  
##  StatedMonthlyIncome LoanOriginalAmount     Year          
##  Min.   :      0     Min.   : 1000      Length:113937     
##  1st Qu.:   3200     1st Qu.: 4000      Class :character  
##  Median :   4667     Median : 6500      Mode  :character  
##  Mean   :   5608     Mean   : 8337                        
##  3rd Qu.:   6825     3rd Qu.:12000                        
##  Max.   :1750003     Max.   :35000                        
##                                                           
##    Quarter         
##  Length:113937     
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
## 

I notice that the variable Term has the same value for Median, 1st Quartile and 3rd Quartile. I wonder if there are only a handful of unique values and whether it is better to convert it to a factor variable. On analysis I find that this has only 3 unique values and so it does make sense to convert to a factor variable and rerun the structure and summary analysis

## [1] "Unique values of Term variable: "
## [1] 36 60 12
## [1] "Structure of the data set after creating factor variables: "
## 'data.frame':    113937 obs. of  17 variables:
##  $ LoanRating            : Factor w/ 8 levels "AA","A","B","C",..: 4 2 7 2 5 3 6 4 1 1 ...
##  $ Term                  : Factor w/ 3 levels "12","36","60": 2 2 2 2 2 3 2 2 2 2 ...
##  $ BorrowerAPR           : num  0.165 0.12 0.283 0.125 0.246 ...
##  $ EstimatedLoss         : num  NA 0.0249 NA 0.0249 0.0925 ...
##  $ ProsperScore          : Factor w/ 11 levels "1","2","3","4",..: NA 7 NA 9 4 10 2 4 9 11 ...
##  $ BorrowerState         : Factor w/ 52 levels "","AK","AL","AR",..: 7 7 12 12 25 34 18 6 16 16 ...
##  $ Occupation            : Factor w/ 68 levels "","Accountant/CPA",..: 37 43 37 52 21 43 50 29 24 24 ...
##  $ CreditScoreRangeLower : int  640 680 480 800 680 740 680 700 820 820 ...
##  $ CurrentCreditLines    : int  5 14 NA 5 19 21 10 6 17 17 ...
##  $ TotalInquiries        : num  3 5 1 1 9 2 0 16 6 6 ...
##  $ RevolvingCreditBalance: num  0 3989 NA 1444 6193 ...
##  $ DebtToIncomeRatio     : num  0.17 0.18 0.06 0.15 0.26 0.36 0.27 0.24 0.25 0.25 ...
##  $ IncomeRange           : Factor w/ 8 levels "$0","$1-24,999",..: 4 5 7 4 3 3 4 4 4 4 ...
##  $ StatedMonthlyIncome   : num  3083 6125 2083 2875 9583 ...
##  $ LoanOriginalAmount    : int  9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
##  $ Year                  : chr  "2007" "2014" "2007" "2012" ...
##  $ Quarter               : chr  "Q3" "Q1" "Q1" "Q4" ...
## [1] "Summary of the data set: "
##    LoanRating    Term        BorrowerAPR      EstimatedLoss  
##  C      :23994   12: 1614   Min.   :0.00653   Min.   :0.005  
##  B      :19970   36:87778   1st Qu.:0.15629   1st Qu.:0.042  
##  D      :19427   60:24545   Median :0.20976   Median :0.072  
##  A      :17866              Mean   :0.21883   Mean   :0.080  
##  E      :13084              3rd Qu.:0.28381   3rd Qu.:0.112  
##  (Other):19465              Max.   :0.51229   Max.   :0.366  
##  NA's   :  131              NA's   :25        NA's   :29084  
##   ProsperScore   BorrowerState                      Occupation   
##  4      :12595   CA     :14717   Other                   :28617  
##  6      :12278   TX     : 6842   Professional            :13628  
##  8      :12053   NY     : 6729   Computer Programmer     : 4478  
##  7      :10597   FL     : 6720   Executive               : 4311  
##  5      : 9813   IL     : 5921   Teacher                 : 3759  
##  (Other):27517          : 5515   Administrative Assistant: 3688  
##  NA's   :29084   (Other):67493   (Other)                 :55456  
##  CreditScoreRangeLower CurrentCreditLines TotalInquiries   
##  Min.   :  0.0         Min.   : 0.00      Min.   :  0.000  
##  1st Qu.:660.0         1st Qu.: 7.00      1st Qu.:  2.000  
##  Median :680.0         Median :10.00      Median :  4.000  
##  Mean   :685.6         Mean   :10.32      Mean   :  5.584  
##  3rd Qu.:720.0         3rd Qu.:13.00      3rd Qu.:  7.000  
##  Max.   :880.0         Max.   :59.00      Max.   :379.000  
##  NA's   :591           NA's   :7604       NA's   :1159     
##  RevolvingCreditBalance DebtToIncomeRatio         IncomeRange   
##  Min.   :      0        Min.   : 0.000    $25,000-49,999:32192  
##  1st Qu.:   3121        1st Qu.: 0.140    $50,000-74,999:31050  
##  Median :   8549        Median : 0.220    $100,000+     :17337  
##  Mean   :  17599        Mean   : 0.276    $75,000-99,999:16916  
##  3rd Qu.:  19521        3rd Qu.: 0.320    Not displayed : 7741  
##  Max.   :1435667        Max.   :10.010    $1-24,999     : 7274  
##  NA's   :7604           NA's   :8554      (Other)       : 1427  
##  StatedMonthlyIncome LoanOriginalAmount     Year          
##  Min.   :      0     Min.   : 1000      Length:113937     
##  1st Qu.:   3200     1st Qu.: 4000      Class :character  
##  Median :   4667     Median : 6500      Mode  :character  
##  Mean   :   5608     Mean   : 8337                        
##  3rd Qu.:   6825     3rd Qu.:12000                        
##  Max.   :1750003     Max.   :35000                        
##                                                           
##    Quarter         
##  Length:113937     
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
## 

The dataset now contains 17 variables with almost 114,000 observations. Next I will run series of univariate plots on these variables

The loan rating of C has the maximum number of borrowers. The graph is a pyramid type structure with AA (the highest rating) has the least number of borrowers, as expected.

Most of the loans have been issued for 36 months. There are hardly any loans for 12 months and none for terms of 2yrs or 4 yrs. I wonder why this is so but potentially this could be due to a non-random sampling of data or due to commercial reasons. We are unable to investigate that in this exercise.

Most of the loans seem to have originated after 2009. There is a sudden dip in the originated loans after 2008. I assume this may be due to the financial crisis but since then there has been a steady growth in loan origination. I see reduction in 2014 which can happen if the sample data does not cover whole of 2014.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
## 0.00653 0.15629 0.20976 0.21883 0.28381 0.51229      25

The BorrowerAPR represents the interest rate being charged to the borrower. The data is in decimals but since this is a percentage value, I have multiplied by 100 to make it easier to read and interpret. Most of the borrowers are being charged the rate between 15% and 30% and the maximum rate being charged is around 51% which is roughly 2.5 times the median. On reducing the binwidth, I see a spike in number of borrowers being charged between 35% and 37%. Visually it looks as if there are 3 pyramid structures - 6% to around 12%, 12% to 33% and 33% to 51%. I wonder if this is due to different terms of the loans as the area under each pyramid seems to be in line with the count values from histogram of the Term variable.

The above are the the BorrowerAPR histograms split by Term. The BorrowerAPR is spread across a wide range for each term and so this confirms that if there are 3 distinct pyramids, then they are not due to the 3 terms.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.005   0.042   0.072   0.080   0.112   0.366   29084

The EstimatedLoss for most loans seems to be within 0.112 but there are some outliers with the maximum value being 0.366. There seem to be a lot of NA which is surprising as we would expect each loan to have an estimate for expected loss value. To identify the reason, I have run a test of the status of the loans that have the EstimatedLoss as NA. I run this on the original loans dataset and not the subset of 17 variables as LoanStatus is not included in the subset.

## [1] "LoanStatus table where EstimatedLoss is NA"
## 
##              Cancelled             Chargedoff              Completed 
##                      5                   6656                  18410 
##                Current              Defaulted FinalPaymentInProgress 
##                      0                   4013                      0 
##   Past Due (>120 days)   Past Due (1-15 days)  Past Due (16-30 days) 
##                      0                      0                      0 
##  Past Due (31-60 days)  Past Due (61-90 days) Past Due (91-120 days) 
##                      0                      0                      0

The results show that all such loans are either Cancelled or ChargedOff or Completed or Defaulted. This seems that the loans are no longer active and so the expected loss from them in future is not defined.

##     1     2     3     4     5     6     7     8     9    10    11  NA's 
##   992  5766  7642 12595  9813 12278 10597 12053  6911  4750  1456 29084

Most of the loans have the ProsperScore between 4 and 8 and the structure is similar to the LoanRating histogram.

##          AK    AL    AR    AZ    CA    CO    CT    DC    DE    FL    GA 
##  5515   200  1679   855  1901 14717  2210  1627   382   300  6720  5008 
##    HI    IA    ID    IL    IN    KS    KY    LA    MA    MD    ME    MI 
##   409   186   599  5921  2078  1062   983   954  2242  2821   101  3593 
##    MN    MO    MS    MT    NC    ND    NE    NH    NJ    NM    NV    NY 
##  2318  2615   787   330  3084    52   674   551  3097   472  1090  6729 
##    OH    OK    OR    PA    RI    SC    SD    TN    TX    UT    VA    VT 
##  4197   971  1817  2972   435  1122   189  1737  6842   877  3278   207 
##    WA    WI    WV    WY 
##  3048  1842   391   150

Roughly 15,000 of the borrowers are from the State of CA which is more than 10% of the dataset. At the same time, there are several states such as ME, AK, IA, ND and WY that have less than or equal to 200 borrowers each. I wonder why the distribution is so skewed and whether it is linked to the population distribution across the States or is there some other factor(s).

##                                                        Accountant/CPA 
##                               3588                               3233 
##           Administrative Assistant                            Analyst 
##                               3688                               3602 
##                          Architect                           Attorney 
##                                213                               1046 
##                          Biologist                         Bus Driver 
##                                125                                316 
##                         Car Dealer                            Chemist 
##                                180                                145 
##                      Civil Service                             Clergy 
##                               1457                                196 
##                           Clerical                Computer Programmer 
##                               3164                               4478 
##                       Construction                            Dentist 
##                               1790                                 68 
##                             Doctor                Engineer - Chemical 
##                                494                                225 
##              Engineer - Electrical              Engineer - Mechanical 
##                               1125                               1406 
##                          Executive                            Fireman 
##                               4311                                422 
##                   Flight Attendant                       Food Service 
##                                123                               1123 
##            Food Service Management                          Homemaker 
##                               1239                                120 
##                           Investor                              Judge 
##                                214                                 22 
##                            Laborer                        Landscaping 
##                               1595                                236 
##                 Medical Technician                  Military Enlisted 
##                               1117                               1272 
##                   Military Officer                        Nurse (LPN) 
##                                346                                492 
##                         Nurse (RN)                       Nurse's Aide 
##                               2489                                491 
##                              Other                         Pharmacist 
##                              28617                                257 
##         Pilot - Private/Commercial  Police Officer/Correction Officer 
##                                199                               1578 
##                     Postal Service                          Principal 
##                                627                                312 
##                       Professional                          Professor 
##                              13628                                557 
##                       Psychologist                            Realtor 
##                                145                                543 
##                          Religious                  Retail Management 
##                                124                               2602 
##                 Sales - Commission                     Sales - Retail 
##                               3446                               2797 
##                          Scientist                      Skilled Labor 
##                                372                               2746 
##                      Social Worker         Student - College Freshman 
##                                741                                 41 
## Student - College Graduate Student           Student - College Junior 
##                                245                                112 
##           Student - College Senior        Student - College Sophomore 
##                                188                                 69 
##        Student - Community College         Student - Technical School 
##                                 28                                 16 
##                            Teacher                     Teacher's Aide 
##                               3759                                276 
##              Tradesman - Carpenter            Tradesman - Electrician 
##                                120                                477 
##               Tradesman - Mechanic                Tradesman - Plumber 
##                                951                                102 
##                       Truck Driver                    Waiter/Waitress 
##                               1675                                436

The borrowers seem to be broadly spread across all occupations with couple of exceptions. The highest number is from “Other” and next from “Professional”. These don’t give much clarity into the occupation and so I wonder whether this field is useful for this analysis.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     0.0   660.0   680.0   685.6   720.0   880.0     591
## [1] "Number of rows with CreditScoreRangeLower value of 0.0"
## [1] 133  17

Re-running the histogram to ignore the values of 0.0

Most of the values for CreditScoreRangeLower are between 660 and 720. The graph has a bigger left tail compared to the right.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    7.00   10.00   10.32   13.00   59.00    7604

Most of the borrowers have between 7 and 13 current credit lines. By definition, this is a minimum of 0. Though, some borrowers with 0 current credit lines have received loans, we should bear in mind that this data only covers people that had the loans accepted, and not the people who applied for loans. There is a possibility that only a small proportion of such people were accepted due to lack of prior credit history.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   2.000   4.000   5.584   7.000 379.000    1159
## [1] "Dimension of data set with TotalInquiries > 50: "
## [1] 202  17

Most people have less than or equal to 7 total inquiries. This dataset has a very long tail, with a maximum value of 379. There are 202 values above 50, out of a total dataset of around 114,000 values which is quite small. I wonder if this is due to the loan origination criteria set by Prosper or whether this is a reflection of the population in general.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       0    3121    8549   17599   19521 1435667    7604
## [1] "Dimension of the data set with RevolvingCreditBalance > 50000: "
## [1] 7318   17

Most of the borrowers have RevolvingCreditBalance of less than 20,000. This value varies greatly across the borrowers with the maximum being around 1,435,000. In this dataset, over 7300 borrowers have this value above 50000. I wonder if this value in itself can directly influence the pricing of the loans as individuals with different income levels can have a different level of outstanding credit balance but the same proportional value, and by intuition, I would guess that the pricing of the loan for an amount in same proportion should be very similar.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.140   0.220   0.276   0.320  10.010    8554
## [1] "Dimension of the data set with DebtToIncomeRatio > 1: "
## [1] 799  17

Most of the borrowers have DebtToIncomeRatio of less that 0.32. There are roughly 800 borrowers with the ratio of more than 1.0 and the maximum value is around 10.0. Similar to the charts above, this graph has a smooth peak and big tail.

##             $0      $1-24,999      $100,000+ $25,000-49,999 $50,000-74,999 
##            621           7274          17337          32192          31050 
## $75,000-99,999  Not displayed   Not employed 
##          16916           7741            806

More than half of the borrowers, roughly 63,000, have income level between $25,000 and $75,000. Another 35,000 borrowers have income level above $75,000. The dataset doesn’t tell us the maximum income level across all borrowers. This may make the analysis bit difficult for borrowers with high income if the actual variation in income above $100,000 is very large.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0    3200    4667    5608    6825 1750003
## [1] "Dimension of the data set with StatedMonthlyIncome > 10000: "
## [1] 9780   17

Most borrowers have a monthly income of less than $7,000 and the median is around 4,650. The maximum monthly income is $1,750,000 which is much bigger than the median. Roughly 9800 borrowers have monthly income above $10,000 which equates to $120,000 per annum. Comparing this to the IncomeRange data, I can imply that roughly 7,500 borrowers have an annual income between $100,000 and $120,000 and remaining 9,800 borrowers have annual income above $120,000. This shows that the distribution of the income in the $100,000+ bracket of the IncomeRange is very wide.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    4000    6500    8337   12000   35000
## [1] "Dimension of the data set with LoanOriginalAmount > 25000: "
## [1] 680  17

Most of the loans have an original amount of less than $12,000. The maximum loan amount is $35,000. The median is $6,500 and 680 loans have an amount above $25,000. I notice from the graph that most loan amounts seem to be rounded values and there are some distinct peaks at $3000, $5000, $10,000, $15000 and $20,000. I wonder if this is due to borrowers’ behaviour that they tend to go for rounded amounts to cover their financial requirements, instead of an exact amount to meet a specific requirement.

Univariate Analysis

What is the structure of your dataset?

The dataset contains information on about 114,000 loans issued by Prosper and includes loan and borrower data. The original dataset has 81 variables and I have taken a subset of 17 variables for the purporse of this analysis. The variable LoanRating is an ordered factor variable with the following levels:

(worst) —————————> (best)

LoanRating: NC, HR, E, D, C, B, A, AA

Some observations about the dataset:

  • Most of the loans have a maturity of 36 months.

  • The median CreditScoreRangeLower is 680

  • Most borrowers have between 7 and 13 current credit lines.

  • Most borrowers have DebtToIncomeRatio of less than 0.32 and the maximum is 10

  • Most borrowers have monthly income of less than $7,000 and the median is around 4,650. The maximum monthly income is $1,750,000

  • There is a sudden dip in number of loan originations in 2008.

What is/are the main feature(s) of interest in your dataset?

The main interesting feature of the dataset is the pricing of the loans and the assesment of credit rating (LoanRating). I would like to determine which features are best for predicting the BorrowerAPR of the loan. I think DebtToIncomeRatio, CreditScoreRangeLower and some other variables can be used to build a predictive model to price the Prosper loans.

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

In addition to DebtToIncomeRatio and CreditScoreRangeLower, I think current credit lines and loan amount may also contribute to the pricing. I am also interested in checking if there is any change in relations around 2009 when the loan originations suddenly dipped and have increased steadily since.

Did you create any new variables from existing variables in the dataset?

I have created a new variable LoanRating by combining CreditGrade and ProsperRating (Alpha). CreditGrade covers loans from period pre-2009 and ProsperRating (Alpha) covers post 2009 period and so it makes sense to combine them for our analysis. I have also created new variables Year and Quarter by splitting the LoanOriginationQuarter variable.

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

On plotting the chart for BorrowerAPR, I noticed 3 different pyramid structures. To check if this was driven by Term variable, with 3 values, I split the chart by Term but realised that all 3 charts have a similar chart pattern. The 3 peaks of the pyramids are not caused by Term. For some reason, there is a sudden rise in the number of loans with BorrowerAPR between 35 and 37. I also notice that there are sudden spikes in the histogram for EstimatedLoss. When I started the project, I thought that the loans with a higher expected chance of loss will have a higher borrower rate, but the histograms for these two do not look similar. The charts start to look similar when observed from left but there is sudden discontinuity in the distribution of EstimatedLoss for values above 0.10 with several peaks. I find this strange and unexpected. I have found a sudden dip in loan origination numbers in 2009, followed by a consistent increase. I am interested in understanding what impact this has on relationships amongst the variables.

I have created new variables LoanRating, Year and Quarter as mentioned above.

Bivariate Plots Section

Based on the Univariate analysis, I will create a subset of the previous data set with 13 variables.

## [1] "New dataset dimensions and variable names: "
## [1] 113937     13
##  [1] "BorrowerAPR"           "EstimatedLoss"        
##  [3] "LoanRating"            "Term"                 
##  [5] "ProsperScore"          "CreditScoreRangeLower"
##  [7] "CurrentCreditLines"    "TotalInquiries"       
##  [9] "DebtToIncomeRatio"     "StatedMonthlyIncome"  
## [11] "LoanOriginalAmount"    "Year"                 
## [13] "Quarter"

I will first create a scatterplot matrix using these variables. I will exclude Year and Quarter from this matrix as they do not represent the terms of the loans

EstimatedLoss correlates with BorrowerAPR which makes sense as borrowers with a higher expected loss should be charged a higher borrowing rate. ProsperScore and CreditScoreRangeLower show moderate correlation with BorrowerAPR. LoanOriginalAmount also shows a moderate correlation with BorrowerAPR but it is negative which I find surprising as higher the loan amount for a borrower greater is the risk and so a higher BorrowerAPR should apply.

EstimatedLoss shows a strong relationship with LoanRating and I would like to see how the relationship looks between LoanRating and BorrowerAPR.

It seems that the CurrentCreditLines, TotalInquiries, StatedMonthlyIncome and DebtToIncomeRatio do not have correlation with the BorrowerAPR. The median of the BorrowerAPR for each Term looks similar but it seems that BorrowerAPR has a wider interquartile range as Term increases.

I want to take a closer look at the scatter plots involving BorrowerAPR and other variables.

As EstimatedLoss increases, BorrowerAPR increases too. We can see some vertical lines where the BorrowerAPR differs significantly for the same EstimatedLoss. The relationship looks non-linear as the plot seems to flatten out as EstimatedLoss increases, especially after EstimatedLoss of 0.15.

## 
## Call:
## lm(formula = I(BorrowerAPR) ~ I(EstimatedLoss), data = subset(loans.sub.2, 
##     !is.na(BorrowerAPR)))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.45226 -0.01117 -0.00189  0.01238  0.19705 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      0.096346   0.000171   563.4   <2e-16 ***
## I(EstimatedLoss) 1.622699   0.001840   881.8   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.02507 on 84851 degrees of freedom
##   (29059 observations deleted due to missingness)
## Multiple R-squared:  0.9016, Adjusted R-squared:  0.9016 
## F-statistic: 7.776e+05 on 1 and 84851 DF,  p-value: < 2.2e-16

Despite the fact that the relationship doesn’t look linear, based on the R^2 value, EstimatedLoss still explains 90% of the variance in BorrowerAPR.

We can see a clear relationship between ProsperScore and the median of BorrowerAPR. As ProsperScore gets low, the median of BorrowerAPR goes up and this is consistent with the negative correlation we saw above.

In this density plot, we see that the peaks move from low borrow rate (for Score 9) to high borrow rate (for Score 1)

## 
## Call:
## lm(formula = I(BorrowerAPR) ~ I(ProsperScore), data = subset(loans.sub.2, 
##     !is.na(BorrowerAPR)))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.20222 -0.04307 -0.00692  0.03908  0.27641 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        0.340580   0.001852  183.86   <2e-16 ***
## I(ProsperScore)2  -0.032346   0.002005  -16.13   <2e-16 ***
## I(ProsperScore)3  -0.057305   0.001969  -29.11   <2e-16 ***
## I(ProsperScore)4  -0.081103   0.001924  -42.16   <2e-16 ***
## I(ProsperScore)5  -0.077726   0.001944  -39.99   <2e-16 ***
## I(ProsperScore)6  -0.101991   0.001926  -52.96   <2e-16 ***
## I(ProsperScore)7  -0.124963   0.001937  -64.51   <2e-16 ***
## I(ProsperScore)8  -0.162060   0.001927  -84.10   <2e-16 ***
## I(ProsperScore)9  -0.191603   0.001981  -96.73   <2e-16 ***
## I(ProsperScore)10 -0.225458   0.002037 -110.70   <2e-16 ***
## I(ProsperScore)11 -0.228745   0.002402  -95.24   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.05834 on 84842 degrees of freedom
##   (29059 observations deleted due to missingness)
## Multiple R-squared:  0.4671, Adjusted R-squared:  0.467 
## F-statistic:  7437 on 10 and 84842 DF,  p-value: < 2.2e-16

The relationship is not linear but based on R^2 value, ProsperScore explains 46.7% of the variation in BorrowerAPR.

CreditScoreRangeLower offers an interesting relationship. There is no visible relationship for CreditScoreRangeLower values below 600 as the median of BorrowerAPR shows only small variation. There is a clear relationship for values greater than 600 as the median reduces with increasing CreditScoreRangeLower. This difference in relation explains why CreditScoreRangeLower was moderately and negatively correlated. This relation across all the values is not linear.

We can see several vertical lines and no general relationship. There is a wide variation in BorrowerAPR across most LoanOriginalAmount values. The variation reduces at certain points and it seems that the mid point of those variations also reduces which may be the reason why we get moderate and negative correlation for LoanOriginalAmount.

## 
## Call:
## lm(formula = I(BorrowerAPR) ~ I(LoanOriginalAmount), data = subset(loans.sub.2, 
##     !is.na(BorrowerAPR)))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.23447 -0.05708  0.00010  0.05883  0.26533 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            2.535e-01  3.759e-04   674.3   <2e-16 ***
## I(LoanOriginalAmount) -4.154e-06  3.608e-08  -115.1   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.07606 on 113910 degrees of freedom
## Multiple R-squared:  0.1043, Adjusted R-squared:  0.1042 
## F-statistic: 1.326e+04 on 1 and 113910 DF,  p-value: < 2.2e-16

The scatterplot did not show a clear relationship but based on R^2 value, it seems the LoanOriginalAmount is able to explain roughly 10% of the variation in BorrowerAPR rate using a linear model.

This sounds counter-intuitive to me as a bigger loan amount should not be reducing the borrowing rate. I realise that the data set includes only the data for the loans that were approved and issued and so this data has a particular bias as it doesn’t include loans where the BorrowerAPR was high and the borrower refused to take the loan at that rate. I wonder if this is distorting the actual relationship since the borrowers can either be sensitive to borrow rate or have more options than Prosper for high loan amounts. We know from above analysis that a high EstimatedLoss implies a higher BorrowerAPR and a high EstimatedLoss is associated with an individual unable to pay back. The scatterplot matrix shows that the DebtToIncomeRatio has a moderate correlation to EstimatedLoss but not to BorrowerAPR. However, according to variable definitions, DebtToIncomeRatio is the debt to income ratio of the borrower at the time the credit profile is pulled and not the resulting ratio after the loan has been provided. I will create a new variable to represent this new ratio and run tests with it. I will divide the loan amount with the stated monthly income annualised (multiplied by 12) and then add this value to DebtToIncomeRatio. To remove errors due to extremely low monthly income values such as $1 or less, I will take the DebtToIncomeRatio as the new ratio for the bottom 2% of the stated monthly incomes cases.

## [1] "Summary of the new debt to income ratio: "
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.240   0.357   0.418   0.500  10.831    8554

## 
## Call:
## lm(formula = I(BorrowerAPR) ~ I(newratio), data = subset(loans.sub.2, 
##     !is.na(BorrowerAPR)))
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.227366 -0.062144 -0.009422  0.060353  0.297288 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.2147422  0.0003048 704.553   <2e-16 ***
## I(newratio) 0.0035818  0.0004336   8.261   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.07949 on 105356 degrees of freedom
##   (8554 observations deleted due to missingness)
## Multiple R-squared:  0.0006473,  Adjusted R-squared:  0.0006378 
## F-statistic: 68.24 on 1 and 105356 DF,  p-value: < 2.2e-16

The scatterplot shows no relationship between the new debt to income ratio and there is certainly no linear relationship either, based on R^2 value. I don’t have access to the data where the loans were not issued, especially for high loan amounts in order to establish whether there is any bias or not with the given data set. For the moment, I will need to assume that the analysis is restricted to loans that were issued and care should be taken before extending to a new loan valuation.

I wish to run density plot for LoanOriginalAmount and will need a factor variable. I am going to split the LoanOriginationAmount into 5 categories (A for lowest and E to highest) based on the placement of the value in 5 equal quartiles.

## [1] "Table for the new variable LoanAmtFactor: "
## 
##     A     B     C     D     E 
## 17815 27019 22807 22945 23351

The density plot shows that lower amounts of LoanOriginalAmount, symbolised by LoanAmtFactor of A, have most loans issued at high values of BorrowerAPR. For the highest loan amounts (Loan Factor E), the concentration of BorrowerAPR value is relatively low. As we go from A to E, the loans have a much more even distribution across BorrowerAPR.

The boxplot shows that the lower values of LoanAmtFactor have a higher median BorrowerAPR. The median reduces as LoanAmtFactor increases, that is, as the LoanOriginalAmount increases.This shows why the correlation for BorrowerAPR and LoanOriginalAmount was negative. This still does not explain why we see this behaviour and this may be due to the bias introduced by the sampling method of the loans.

We see a clear relation between LoanRating and BorrowerAPR as the BorrowerAPR median values increase with LoanRating, except for NC.

The density plot shows the relationship with BorrowerAPR more clearly. I notice clear separated peaks for LoanRating with peak at lowest BorrowerAPR for AA LoanRating and peaks moving to higher BorrowerAPR with increase in LoanRating.

## 
## Call:
## lm(formula = I(BorrowerAPR) ~ I(LoanRating), data = subset(loans.sub.2, 
##     !is.na(BorrowerAPR)))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.31895 -0.01609  0.00239  0.02376  0.23531 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     0.0964053  0.0004059  237.51   <2e-16 ***
## I(LoanRating)A  0.0419086  0.0004965   84.41   <2e-16 ***
## I(LoanRating)B  0.0832982  0.0004878  170.78   <2e-16 ***
## I(LoanRating)C  0.1220308  0.0004750  256.90   <2e-16 ***
## I(LoanRating)D  0.1695741  0.0004898  346.18   <2e-16 ***
## I(LoanRating)E  0.2191068  0.0005257  416.75   <2e-16 ***
## I(LoanRating)HR 0.2311807  0.0005520  418.83   <2e-16 ***
## I(LoanRating)NC 0.1386092  0.0032556   42.58   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.03822 on 113773 degrees of freedom
##   (131 observations deleted due to missingness)
## Multiple R-squared:  0.7737, Adjusted R-squared:  0.7737 
## F-statistic: 5.557e+04 on 7 and 113773 DF,  p-value: < 2.2e-16

LoanRating explains 77% of the variation in BorrowerAPR values based on R^2 value.

I want to check the relationship between LoanRating and EstimatedLoss, based on the scatterplot matrix results.

## loans.sub.2$LoanRating: AA
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.005   0.010   0.015   0.014   0.017   0.020    3509 
## -------------------------------------------------------- 
## loans.sub.2$LoanRating: A
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.020   0.025   0.032   0.031   0.037   0.040    3315 
## -------------------------------------------------------- 
## loans.sub.2$LoanRating: B
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.041   0.047   0.052   0.052   0.057   0.060    4389 
## -------------------------------------------------------- 
## loans.sub.2$LoanRating: C
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.061   0.067   0.077   0.076   0.085   0.090    5649 
## -------------------------------------------------------- 
## loans.sub.2$LoanRating: D
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.092   0.099   0.108   0.106   0.112   0.119    5153 
## -------------------------------------------------------- 
## loans.sub.2$LoanRating: E
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.122   0.132   0.142   0.140   0.147   0.149    3289 
## -------------------------------------------------------- 
## loans.sub.2$LoanRating: HR
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.152   0.162   0.165   0.173   0.170   0.366    3508 
## -------------------------------------------------------- 
## loans.sub.2$LoanRating: NC
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##      NA      NA      NA     NaN      NA      NA     141

This is an interesting result. Each LoanRating value has non overlapping values of EstimatedLoss. Moreover, the maximum and minumum values for each rating are very close to the minimum value of higher rating and maximum value of lower rating respectively. This suggests that the LoanRating values are categorisation of EstimatedLoss.

## loans.sub.2$Term: 12
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.04935 0.14716 0.22189 0.21622 0.29167 0.35843 
## -------------------------------------------------------- 
## loans.sub.2$Term: 36
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
## 0.00653 0.14857 0.20976 0.21943 0.29265 0.51229      25 
## -------------------------------------------------------- 
## loans.sub.2$Term: 60
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.07111 0.17184 0.20931 0.21684 0.25718 0.35838

The median of the BorrowerAPR for all the 3 terms is almost the same. The terms 12 and 36 have almost th same Inter Quartile Range while this is much smaller for Term of 60. The maximum value for Terms 12 and 60 is almost same as just under 0.36. This means that the big BorrowerAPR numbers with the third “pyramid” style peak are all from Term 36.

Given the high correlation between BorrowerAPR and EstimatedLoss, I want to see these plots with EstimateLoss too.

## loans.sub.2$Term: 12
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
## 0.00490 0.03500 0.06240 0.07413 0.10800 0.14750       1 
## -------------------------------------------------------- 
## loans.sub.2$Term: 36
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.005   0.040   0.077   0.084   0.119   0.366   29083 
## -------------------------------------------------------- 
## loans.sub.2$Term: 60
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0049  0.0474  0.0650  0.0715  0.0899  0.1490

EstimatedLoss has a similar profile, except that its median for Term 36 is a bit higher than others.

## loans.sub.2$Year: 2005
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##      NA      NA      NA     NaN      NA      NA      22 
## -------------------------------------------------------- 
## loans.sub.2$Year: 2006
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
## 0.00653 0.14709 0.19884 0.20061 0.25757 0.51229       3 
## -------------------------------------------------------- 
## loans.sub.2$Year: 2007
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00653 0.13501 0.17722 0.18593 0.23662 0.37036 
## -------------------------------------------------------- 
## loans.sub.2$Year: 2008
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01315 0.13625 0.17791 0.20363 0.26214 0.41355 
## -------------------------------------------------------- 
## loans.sub.2$Year: 2009
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.05636 0.12511 0.21122 0.21594 0.28777 0.39951 
## -------------------------------------------------------- 
## loans.sub.2$Year: 2010
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.04583 0.13068 0.24999 0.23899 0.33097 0.42395 
## -------------------------------------------------------- 
## loans.sub.2$Year: 2011
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.05927 0.19108 0.29486 0.26230 0.34731 0.38723 
## -------------------------------------------------------- 
## loans.sub.2$Year: 2012
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.06587 0.18637 0.25781 0.25315 0.33553 0.35840 
## -------------------------------------------------------- 
## loans.sub.2$Year: 2013
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.06106 0.16732 0.20933 0.21395 0.25781 0.35797 
## -------------------------------------------------------- 
## loans.sub.2$Year: 2014
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.06726 0.14206 0.17710 0.18272 0.21648 0.34630

The median BorrowerAPR increased sharply after 2008 till 2011 and then it has consistently reduced till the last year of 2014 in our sample. Along with the median increase, the IQR also increased between 2008 and 2011, after which it collapsed suddenly. The IQR has been the lowest in 2013 and 2014. There looks to be a positive relation between median and IQR, split by year.

Earlier in the univariate plots, I had found several peaks in the BorrowerAPR histogram. The above plot shows that some of the peaks are prevalent in a particular year. For example, the peak at BorrowerAPR value above 35 is in year 2012. In 2013, we see a more balanced distribution. I wonder if this behaviour is a result of changes in business and customer behaviour or due to some changes in relationships between the variables.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

The BorrowerAPR rate correlates highly with the EstimatedLoss values, ProsperScore and LoanRating.

The Borrower’s interest rate goes up as the estimated loss value of the loan goes up. In the scatterplot of BorrowerAPR vs EstimatedLoss, there are several vertical bands representing data points where the BorrowerAPR rate varies with similar EstimatedLoss. Using a linear model, I find that EstimatedLoss explains roughly 90% of the variation in BorrowerAPR. The Borrower’s interest rate goes up as the quality of loan reduces, represented by Loan Rating. Based on a linear model, Loan Rating explains 77% of the variation in BorrowerAPR. ProsperScore has a negative relation with BorrowerAPR as the Borrower’s interest rate goes up with a lower ProsperScore. Through a linear model, I find that ProsperScore explains 46.7% of the variation in BorrowerAPR.

There are 2 more features, CreditScoreRangeLower and LoanOriginalAmount that also have a relationship with BorrowerAPR that isn’t so strong. As the CreditScoreRangeLower value goes high, the median of BorrowerAPR comes down but this relationship only holds true once the CreditScoreRangeLower value is above 600. The LoanOriginalAmount has a negative correlation with BorrowerAPR and explains 10% of the variation in the Borrower’s interest rate. However, this result does not look intuitive to me and so I doubt that there is a bias introduced in the results due to the given dataset.

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

EstimatedLoss has a strong negative correlation with ProsperScore, CreditScoreRangeLower and LoanOriginalAmount. In addition, ProsperScore has a medium positive correlation with CreditScoreRangeLower. TotalInquires has a medium negative correlation with ProsperScore and CreditScoreRangeLower. The CurrentCreditLines has a medium positive correlation with CreditScoreRangeLower. RevolvingCreditBalance has a positive correlation with CurrentCreditLines. I have deduced that the LoanRating values are categorisation of the EstimatedLoss values and hence, should not be treated as a separate variable. BorrowerAPR values have a noticeably different distribution across the years. The reason behind this is not clear and could be dependent on customer behaviour or other factors that are outside the scope of this analysis. It is also possible that the relationships identified above have changed over time.

What was the strongest relationship you found?

The strongest relationship is between BorrowerAPR and EstimatedLoss. As per the linear model, based on R^2 values, EstimatedLoss explains 90% of the variation in the BorrowerAPR values. I found a strong relation between BorrowerAPR and LoanRating as well, but having also found that LoanRating is a categorisation of EstimatedLoss, I will not treat it as an independent variable.

Multivariate Plots Section

It is clear that EstimatedLoss is a key driver to the value of BorrowerAPR. In previous section, we observed in the scatter plot for BorrowerAPR vs EstimatedLoss that there were several vertical lines representing multiple values of BorrowerAPR for the same value of EstimatedLoss. I am interested in finding what drives these variations in values. In my next round of plots, I want to see how other factors that show relationship with BorrowerAPR (shown in Bivariate Plots analysis) relate to the ratio of BorrowerAPR and EstimatedLoss. I will create a new variable “APR_by_Loss” as the ratio of BorrowerAPR and EstimatedLoss.

## [1] "Summary of new variable APR_by_Loss"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.591   2.558   3.040   3.429   3.863  35.613   29084

We see that there is a reasonable divergence in the values for APR_by_Loss. It is this divergence that we want to explain using other variables.

## loans.sub.2$Term: 12
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   2.140   2.677   3.157   3.554   3.824  13.447       1 
## -------------------------------------------------------- 
## loans.sub.2$Term: 36
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.591   2.470   2.942   3.428   3.934  35.613   29083 
## -------------------------------------------------------- 
## loans.sub.2$Term: 60
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.080   2.676   3.091   3.424   3.788  24.492

We see that the vertical lines in the BorrowerAPR vs EstimatedLoss are not explained by Term as most seem to occur for a specific value of Term. We do notice that in case of Term of 60, the plot seems to diverge a bit more around the plots for Terms 12 and 36, ignoring the vertical lines. This tells me that there may be some variable causing the divergence.

## loans.sub.2$LoanAmtFactor: A
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.594   2.308   2.726   3.312   3.423  25.592    8861 
## -------------------------------------------------------- 
## loans.sub.2$LoanAmtFactor: B
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.594   2.228   2.454   2.853   2.996  20.177    6295 
## -------------------------------------------------------- 
## loans.sub.2$LoanAmtFactor: C
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.591   2.668   3.065   3.513   3.934  35.613    7259 
## -------------------------------------------------------- 
## loans.sub.2$LoanAmtFactor: D
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.645   2.729   3.253   3.595   3.999  24.492    3687 
## -------------------------------------------------------- 
## loans.sub.2$LoanAmtFactor: E
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   1.085   2.952   3.529   3.847   4.257  16.990    2982

The median of the APR_by_Loss increases as the LoanAmtFactor increases which implies a higher value of borrow rate for same Estimated Loss. This is the relationship I was expecting, instead of the negative correlation between the BorrowerAPR and LoanOriginalAmount. As a result, I believe that LoanAmtFactor is a better variable to study the relationship with BorrowerAPR.

The plots show that the vertical lines appear mainly for the lower values of the Loan amounts. I also notice that if I were to create a linear model for BorrowerAPR using EstimatedLoss then I will get a different slope for each value of LoanAmtFactor. The slope of the line increases as the loan amount increases. Based on the above analysis, I think LoanAmtFactor does have a non-random relationship with BorrowerAPR and should be considered when creating a model to predict BorrowerAPR.

## loans.sub.2$ProsperScore: 1
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.5911  2.1371  2.2801  2.1811  2.4052  4.2704 
## -------------------------------------------------------- 
## loans.sub.2$ProsperScore: 2
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.8202  2.2062  2.3664  2.4134  2.5530  4.8544 
## -------------------------------------------------------- 
## loans.sub.2$ProsperScore: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.8193  2.2519  2.5039  2.5726  2.7871  5.4795 
## -------------------------------------------------------- 
## loans.sub.2$ProsperScore: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.139   2.390   2.676   2.769   3.050   5.702 
## -------------------------------------------------------- 
## loans.sub.2$ProsperScore: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.067   2.425   2.773   2.900   3.323   5.726 
## -------------------------------------------------------- 
## loans.sub.2$ProsperScore: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.347   2.640   2.952   3.167   3.638  12.909 
## -------------------------------------------------------- 
## loans.sub.2$ProsperScore: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.406   2.827   3.125   3.401   3.891  13.243 
## -------------------------------------------------------- 
## loans.sub.2$ProsperScore: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.185   3.166   3.727   3.987   4.615  10.589 
## -------------------------------------------------------- 
## loans.sub.2$ProsperScore: 9
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.234   3.595   4.411   4.592   5.379  11.813 
## -------------------------------------------------------- 
## loans.sub.2$ProsperScore: 10
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.618   4.225   5.210   6.139   6.714  35.613 
## -------------------------------------------------------- 
## loans.sub.2$ProsperScore: 11
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.529   4.219   4.852   5.520   6.389   9.089

I realise that loans with lower ProsperScore have less occurence of low BorrowerAPR. These are also the loans with some of the highest EstimatedLoss values. The vertical lines in the scatter plots reduce as the ProsperScore increases but these are spread over a wide range of ProsperScore values. The slope of the linear plot increases as ProsperScore increases which is consistent with the increase in median of APR_by_Loss with increase in ProsperScore. I believe that ProsperScore certainly has a relation with BorrowerAPR and should be considered in creating a model.

Combining these results with those from LoanAmtFactor, I infer that loans with lower ProsperScore have loans of mainly low amounts, although, the causality cannot be established from this analysis.

## factor(loans.sub.2$CreditScoreRangeLower): 0
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##      NA      NA      NA     NaN      NA      NA     133 
## -------------------------------------------------------- 
## factor(loans.sub.2$CreditScoreRangeLower): 360
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##      NA      NA      NA     NaN      NA      NA       1 
## -------------------------------------------------------- 
## factor(loans.sub.2$CreditScoreRangeLower): 420
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##      NA      NA      NA     NaN      NA      NA       5 
## -------------------------------------------------------- 
## factor(loans.sub.2$CreditScoreRangeLower): 440
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##      NA      NA      NA     NaN      NA      NA      36 
## -------------------------------------------------------- 
## factor(loans.sub.2$CreditScoreRangeLower): 460
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##      NA      NA      NA     NaN      NA      NA     141 
## -------------------------------------------------------- 
## factor(loans.sub.2$CreditScoreRangeLower): 480
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##      NA      NA      NA     NaN      NA      NA     346 
## -------------------------------------------------------- 
## factor(loans.sub.2$CreditScoreRangeLower): 500
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##      NA      NA      NA     NaN      NA      NA     554 
## -------------------------------------------------------- 
## factor(loans.sub.2$CreditScoreRangeLower): 520
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##      NA      NA      NA     NaN      NA      NA    1593 
## -------------------------------------------------------- 
## factor(loans.sub.2$CreditScoreRangeLower): 540
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##      NA      NA      NA     NaN      NA      NA    1474 
## -------------------------------------------------------- 
## factor(loans.sub.2$CreditScoreRangeLower): 560
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##      NA      NA      NA     NaN      NA      NA    1357 
## -------------------------------------------------------- 
## factor(loans.sub.2$CreditScoreRangeLower): 580
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##      NA      NA      NA     NaN      NA      NA    1125 
## -------------------------------------------------------- 
## factor(loans.sub.2$CreditScoreRangeLower): 600
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.7058  2.2604  2.4014  2.3829  2.6632  4.9298    2562 
## -------------------------------------------------------- 
## factor(loans.sub.2$CreditScoreRangeLower): 620
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.7915  2.2801  2.4297  2.4708  2.6951  4.9298    2519 
## -------------------------------------------------------- 
## factor(loans.sub.2$CreditScoreRangeLower): 640
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.594   2.283   2.582   2.658   2.863   8.243    3350 
## -------------------------------------------------------- 
## factor(loans.sub.2$CreditScoreRangeLower): 660
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.5911  2.3970  2.7208  2.8591  3.1324  8.2455    2233 
## -------------------------------------------------------- 
## factor(loans.sub.2$CreditScoreRangeLower): 680
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.7328  2.4627  2.8595  3.0275  3.4292  8.2435    2473 
## -------------------------------------------------------- 
## factor(loans.sub.2$CreditScoreRangeLower): 700
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.7967  2.6327  3.0842  3.2969  3.7525 10.5952    1861 
## -------------------------------------------------------- 
## factor(loans.sub.2$CreditScoreRangeLower): 720
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.7507  2.7988  3.3651  3.6103  4.0705 18.9950    1889 
## -------------------------------------------------------- 
## factor(loans.sub.2$CreditScoreRangeLower): 740
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.8066  2.9787  3.6345  3.9519  4.5809 35.6133    1396 
## -------------------------------------------------------- 
## factor(loans.sub.2$CreditScoreRangeLower): 760
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   1.126   3.141   3.907   4.258   4.970  18.728    1353 
## -------------------------------------------------------- 
## factor(loans.sub.2$CreditScoreRangeLower): 780
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   1.001   3.590   4.552   5.021   5.726  24.492     919 
## -------------------------------------------------------- 
## factor(loans.sub.2$CreditScoreRangeLower): 800
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   1.550   3.873   4.943   5.532   6.496  22.603     537 
## -------------------------------------------------------- 
## factor(loans.sub.2$CreditScoreRangeLower): 820
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.8745  4.1537  5.3813  5.9067  6.8280 17.0567     367 
## -------------------------------------------------------- 
## factor(loans.sub.2$CreditScoreRangeLower): 840
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   2.203   4.444   5.861   6.400   7.708  23.033     169 
## -------------------------------------------------------- 
## factor(loans.sub.2$CreditScoreRangeLower): 860
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   2.203   4.552   5.921   6.464   7.837  16.718      90 
## -------------------------------------------------------- 
## factor(loans.sub.2$CreditScoreRangeLower): 880
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   2.724   5.761   6.828   6.725   7.710  13.084      10

The plots show that the vertical lines appear mainly for the lower values of the CreditScoreRangeLower. I also notice that if I were to create a linear model for BorrowerAPR using EstimatedLoss then I will get a different slope for each value of CreditScoreRangeLower. The slope of the line increases as the loan amount increases. Based on the above analysis, I think CreditScoreRangeLower does have a non-random relationship with BorrowerAPR and should be considered when creating a model to predict BorrowerAPR.

I notice that there are no data points for EstimatedLoss for years 2005 to 2008. This must be the loans that have N/A for EstimatedLoss due to expiry or other reasons that we found in the Univariate analysis.

I will create a new data set to remove these years. This will not have any adverse impact to the analysis as I need only data that has EstimatedLoss values. I will use the new data set going forward.

## [1] "Dimensions of the new data set: "
## [1] 84997    17

I will recreate the last plot with the new dataset.

The vertical lines on the BorrowerAPR vs EstimatedLoss scatterplots are present for years before (and including) 2010 only. This is an interesting finding as it implies that some relationships have changed at the end of 2010 to remove the vertical bands and provide an almost linear profile on these scatter plots.

## 
## Call:
## lm(formula = I(BorrowerAPR) ~ I(EstimatedLoss), data = subset(loans.sub.2.sub_year, 
##     !is.na(BorrowerAPR) & Year >= 2011))
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.112996 -0.012286 -0.001311  0.011901  0.070112 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      0.0900705  0.0001321   681.6   <2e-16 ***
## I(EstimatedLoss) 1.7089812  0.0014547  1174.8   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.01776 on 77296 degrees of freedom
## Multiple R-squared:  0.947,  Adjusted R-squared:  0.947 
## F-statistic: 1.38e+06 on 1 and 77296 DF,  p-value: < 2.2e-16

On running the linear model for BorrowerAPR vs EstimatedLoss using the data from and including 2011, I find that EstimatedLoss explains 94.7% of the variation in BorrowerAPR. This is an improvement over the 90% R-squared value observed in bivariate analysis for EstimatedLoss that included the data across all the years.

I will now run some tests to see what further relationships exist when the plots are split by year.

We can see that 2014 only has values for 1st quarter which explains the relatively low number of loan originations for 2014 in our data set. For 2013, I see that for same EstimatedLoss value, the BorrowerAPR rate is higher in Q1 vs Q4.

Looking at 2013 plot more closely shows the above observation clearly. This can be due to change in Prosper’s loan pricing method or particular short term concentration in a particular type of loan or customer behaviour. I am unable to analyse this due to the lack of data.

We see a consistent behaviour across all years for ProsperScore.

The behaviour here is also consistent across the years.

In the years 2009 and 2010, most of the loans had a low value for LoanAmtFactor with the scatterplot showing data across the chart and no relationships can be observed. From 2011 onwards, we see that lower LoanAmtFactor tend to have higher BorrowerAPR and EstimatedLoss. Also during these years, the slopes of the linear models for BorrowerAPR vs EstimatedLoss, split by LoanAmtFactor, diverge only slightly. LoanAmtFactor A has lowest slope while E has highest slope.

In these plots, I find that for 2009 and 2010, most of the loans have Term of 36 and so not much to observe on patterns. From 2011 onwards, there seems to be small difference in BorrowerAPR for different Terms. This behaviour has changed with time. In 2012, for same EstimatedLoss, Term 60 loans had highest BorrowerAPR while Term 12 loans had lowest BorrowerAPR. In 2013, Q1 and Q2 show same trend but this changes during Q3 and finally in Q4, Term 36 loans have highest BorrowerAPR while Term 60 has lowest BorrowerAPR.

The plots in this section show that a model can be build to predict BorrowerAPR using EstimatedLoss, Year, ProsperScore, Term, Quarter and LoanAmtFactor. I will develop a linear model with these variables for a subset of loans that originated after, and including, 2011. Though, we have seen that for period before 2011, EstimatedLoss can predict 90% of the BorrowerAPR variation but it seems clear that the BorrowerAPR behaviour changed from 2011, which removed the vertical bands in the BorrowerAPR vs EstimatedLoss scatterplot. Hence, different models should be created to reflect the change in behaviour. Here I am developing model for 2011 onwards as that it is a more recent period.

## 
## Calls:
## lm1: lm(formula = I(BorrowerAPR) ~ I(EstimatedLoss), data = loans.sub.2.sub_year.2011_plus)
## lm2: lm(formula = I(BorrowerAPR) ~ I(EstimatedLoss) + Year, data = loans.sub.2.sub_year.2011_plus)
## lm3: lm(formula = I(BorrowerAPR) ~ I(EstimatedLoss) + Year + ProsperScore, 
##     data = loans.sub.2.sub_year.2011_plus)
## lm4: lm(formula = I(BorrowerAPR) ~ I(EstimatedLoss) + Year + ProsperScore + 
##     Term, data = loans.sub.2.sub_year.2011_plus)
## lm5: lm(formula = I(BorrowerAPR) ~ I(EstimatedLoss) + Year + ProsperScore + 
##     Term + Quarter + LoanAmtFactor, data = loans.sub.2.sub_year.2011_plus)
## 
## ====================================================================================================
##                          lm1             lm2             lm3             lm4             lm5        
## ----------------------------------------------------------------------------------------------------
##   (Intercept)             0.090***        0.100***        0.111***        0.107***        0.105***  
##                          (0.000)         (0.000)         (0.001)         (0.001)         (0.001)    
##   I(EstimatedLoss)        1.709***        1.667***        1.554***        1.554***        1.568***  
##                          (0.001)         (0.001)         (0.002)         (0.002)         (0.002)    
##   Year: 2012/2011                         0.001**        -0.000*         -0.002***       -0.002***  
##                                          (0.000)         (0.000)         (0.000)         (0.000)    
##   Year: 2013/2011                        -0.008***       -0.012***       -0.014***       -0.013***  
##                                          (0.000)         (0.000)         (0.000)         (0.000)    
##   Year: 2014/2011                        -0.021***       -0.024***       -0.026***       -0.025***  
##                                          (0.000)         (0.000)         (0.000)         (0.000)    
##   ProsperScore: 2                                         0.006***        0.007***        0.007***  
##                                                          (0.001)         (0.001)         (0.001)    
##   ProsperScore: 3                                         0.005***        0.005***        0.005***  
##                                                          (0.001)         (0.001)         (0.001)    
##   ProsperScore: 4                                         0.004***        0.004***        0.003***  
##                                                          (0.001)         (0.001)         (0.001)    
##   ProsperScore: 5                                         0.006***        0.005***        0.005***  
##                                                          (0.001)         (0.001)         (0.001)    
##   ProsperScore: 6                                         0.006***        0.005***        0.005***  
##                                                          (0.001)         (0.001)         (0.001)    
##   ProsperScore: 7                                         0.004***        0.003***        0.003***  
##                                                          (0.001)         (0.001)         (0.001)    
##   ProsperScore: 8                                        -0.005***       -0.006***       -0.006***  
##                                                          (0.001)         (0.001)         (0.001)    
##   ProsperScore: 9                                        -0.013***       -0.014***       -0.014***  
##                                                          (0.001)         (0.001)         (0.001)    
##   ProsperScore: 10                                       -0.024***       -0.024***       -0.024***  
##                                                          (0.001)         (0.001)         (0.001)    
##   ProsperScore: 11                                       -0.020***       -0.020***       -0.018***  
##                                                          (0.001)         (0.001)         (0.001)    
##   Term: 36/12                                                             0.004***        0.003***  
##                                                                          (0.000)         (0.000)    
##   Term: 60/12                                                             0.011***        0.010***  
##                                                                          (0.000)         (0.000)    
##   Quarter: Q2/Q1                                                                          0.003***  
##                                                                                          (0.000)    
##   Quarter: Q3/Q1                                                                          0.003***  
##                                                                                          (0.000)    
##   Quarter: Q4/Q1                                                                         -0.003***  
##                                                                                          (0.000)    
##   LoanAmtFactor: B                                                                       -0.001***  
##                                                                                          (0.000)    
##   LoanAmtFactor: C                                                                        0.003***  
##                                                                                          (0.000)    
##   LoanAmtFactor: D                                                                        0.001**   
##                                                                                          (0.000)    
##   LoanAmtFactor: E                                                                        0.001*    
##                                                                                          (0.000)    
## ----------------------------------------------------------------------------------------------------
##   R-squared               0.947           0.955           0.963           0.965           0.966     
##   adj. R-squared          0.947           0.955           0.963           0.965           0.966     
##   sigma                   0.018           0.016           0.015           0.014           0.014     
##   F                 1380124.496      409751.704      142535.688      132011.249       95361.037     
##   p                       0.000           0.000           0.000           0.000           0.000     
##   Log-likelihood     201891.460      208212.173      215511.062      217628.613      219038.125     
##   Deviance               24.382          20.703          17.140          16.227          15.645     
##   AIC               -403776.920     -416412.345     -430990.123     -435221.226     -438026.249     
##   BIC               -403749.154     -416356.813     -430842.036     -435054.628     -437794.864     
##   N                   77298           77298           77298           77298           77298         
## ====================================================================================================

The variables in the linear model (lm5) account for 96.6% of the variation in BorrowerAPR. Even with just EstimatedLoss, the model accounts for 94.7% of the variation.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

In this Section, I realised that the year of origination has a significant role is understanding the relationships. The BorrowerAPR behaviour with EstimatedLoss changes from 2011 onwards. Over time, some relationships such as Term of loan to BorrowerAPR have also changed. Inclusion of year of origination has improved the understanding of the relationship between the variables.

Were there any interesting or surprising interactions between features?

I found the change in behaviour of BorrowerAPR to EstimatedLoss around 2010 interesting. The contribution of Term to BorrowerAPR also changes over time. Both these features were unexpected and raise my curiousity over the rationale behind them.

OPTIONAL: Did you create any models with your dataset? Discuss the
strengths and limitations of your model.

Yes, I have created a linear model that can explain 96.6% of the variation in BorrowerAPR. I have been able to identify EstimatedLoss as the key variable that alone explains 94.7% of the variation. I have identified multiple variables that influence the relationship between BorrowerAPR and EstimatedLoss. The model has been developed for period from 2011 onwards and hence, should not be used for period before then. The relationships between the variables have changed over time and so it is possible that the model also have to be re-developed to account for the changes. Some relationships are not linear, such as that of the Year with other variables. Other models, beyond linear, should be explored to achieve greater accuracy.


Final Plots and Summary

Plot One

Description One

The distribution of the borrow rates seems to have multiple peaks. The peak at borrow rates above 35% seems to stand out. This could be perhaps due to the concentration of loans with similar features, such as Term.

Plot Two

Description Two

Loans with the highest amount have the lowest median borrow rate. Generally, loans with small amounts have a much higher median borrow rate. The density plot shows that the smallest size loans, symbolised by LoanAmtFactor of A, have most of its loan issued at high values of borrow rate. As we go from Loan Factor A to E, the loans have a much more even distribution across Borrower’s APR.

Plot Three

Description Three

The vertical lines on the Borrower’s APR vs Estimated Loss scatterplots are present for years before (and including) 2010 only. This implies that some relationships have changed at the end of 2010 to remove the vertical bands and provide an almost linear profile on these scatter plots. The linear profile implies that a linear model can be constructed to predict the borrow rate using Estimated Loss. Looking at 2013 closely, the relationship of Term with the linear relationship can be seen changing as we go from Q1 to Q4. This shows that Term should also be used in developing the model but its contribution is non-linear as it varies with other variables, Year and Quarter.


Reflection

The Prosper loans data set contains almost 114,000 loans with 81 variables. I started by creating 2 new variables and taking a subset of 17 variables to analyse. I investigated each variable to understand what they represent and their behaviour. I observed the relationships between the variables and more specifically with the borrow rate of the loans. During the process, I explored interesting questions and came across some unexpected results. Finally, I developed a linear model to predict the borrow rates for loans that originate from 2011 onwards.

There is a clear relationship between Estimated Loss and the borrow rate but there remained a fair amount of unexplained variations. As I explored other variables, I found some additional relationships to explain the variation in borrow rate. However, even with these variables, I struggled to explain the variation in borrow rate for same estimated loss that could be seen as vertical lines in BorrowerAPR and EstimatedLoss scatterplot. Earlier in my analysis, I had observed a sudden drop in loan origination in 2009 and then a consistent increase in origination over the years. I wondered if there is any changes in modelling or customer behaviour around 2009 that I should consider. I was very surprised to find that the unexplained variation, that is the vertical lines in the scatterplot were all coming from the years 2009 and 2010. This was unexpected and the data set is unable to provide a reason for this. I further explored the data using the variables I had identified earlier and analysed the relationships by year and quarter of loan origination. I was surprised to see change in Term’s relationship to the borrow rate over time and was able to see how it had gradually changed over the quarters in 2013. In the end, I have developed a linear model for loans orginating from 2011 onwards and can explain 96.6% variation in the borrow rate. I have used Estimated Loss, Year, Quarter, ProsperScore, LoanAmtFactor and Term to develop this model.

The linear model has some limitations. It has been developed for period from 2011 onwards and should not be used for earlier periods in the data set. The model may need to be re-developed if the relationship between variables changes as we have seen in the data set. Some relationships such as that of the Year with other variables are not linear. For future work, I would like to test non-Linear models to explain a greater percentage in borrow rate variation. I would also like to analyse the variables that were filtered out to find an explanation for the unexpected changes in the relationships observed in this analysis.